Posts on central websites need less originality to be noticed

Information has major consequences for democracy and society. It is important to understand what factors favor its diffusion. The impact of the content of a message on its likelihood of going viral is poorly understood. Some studies say originality is important for a message not to be overlooked. Others give more relevance to paratextual elements—network centrality, timing, human cognitive limits. Here we propose that originality and centrality interact in a nontrivial way, which might explain why originality by itself is not a good predictor of success. We collected data from Reddit on users sharing hyperlinks. We estimated the originality of each post title and the centrality of the website hosting the shared link. We show that the interaction effect exists: if users share content from a central website, originality no longer increases the odds of receiving at least one upvote. The same is not true for the odds of becoming one of the top 10% scoring posts. We show that originality is concentrated in the domain network: domains in the core of the network produce more original content. Our results imply that research on online information virality needs to take into account the nontrivial interaction between originality and prominence.


Overview of Robustness Checks
In this section we examine all the β 3 coefficient values for Equation 1 in the main paper for all robustness checks we performed. The aim is to show that all β 3 coefficients related to RQ1 are negative and significant, thus showing the effect of the interaction between originality and centrality when it comes to gather the first upvote. On the other hand, there is no consistent pattern for the β 3 coefficients related to RQ2, showing that there is no clear relationship between originality and centrality for top-scoring posts.
In all the tables that follow (here and in the next sections), we mark the statistical significance using a classical star notation: * for p < 0.1, * * for p < 0.05, and * * * for p < 0.01.

Data
Model   Table 2. All β 3 coefficients for all robustness checks we ran for RQ2. * p<0.1; * * p<0.05; * * * p<0.01 Table 1 shows all β 3 coefficients for RQ1, as we vary: the definition of centrality (indegree, betweenness, pagerank, or number of posts per domain), the model we run (Linear Probability Model with random effects versus logit with fixed effects), the observation period (December 2019 vs November 2019 -January 2020), the definition of originality (the simple version we use in the paper vs the one based on information entropy), the definition of success (upvotes vs comments), and whether or not we filter underused subreddits and domains. Section 3 contains further details for each of these robustness checks. Note that for analyses with extended data (additional months, subreddits or domains), the Linear Probability Model (LPM) became computationally intractable, meaning we relied on the logit model instead.
We can see that, with one exception, all of these interaction coefficients are negative and significant, confirming our discussion in the main paper. The exception occurs when we use number of posts as a measure of domain centrality and comments as the criterion of success. The explanation for this failure lies in two aspects. First, using number of posts as a ranking is significantly different from all other ranking criteria -as we show in the paper -, since it does not use any network notion of centrality. Second, comments as a criterion for success is much stricter than upvotes: getting one comment is a much more demanding threshold than getting one upvote. Thus in this case RQ1 is stretched, because getting one comment is not the minimum threshold of attention in the same sense as getting one upvote. While this specific finding may be interesting to investigate in future work, it is not especially problematic for our main result. Table 2 shows all β 3 coefficients for RQ2. The two bottom rows report coefficients for robustness checks specific to RQ2. In these tests we move the threshold of what we consider a "top scoring" post. Rather than using our default threshold of being in the top 10% scoring in terms of upvotes, in these two tests we set the threshold at 7.5% and 12.5%, respectively. Across all tests, the coefficients vary between being statistically non-significant, being positive, and being negative. This means our answer to RQ2 is inconclusive.

Fixed Effects Logit Model
In this section we discuss the relationship between the simple and intuitive LPM with random effects and the more appropriate but less intuitive logit model with fixed effects. We see that the two models never result in different signs on the interaction coefficients, supporting our decision to use the simple LPM to build an intuition of our results as we do in the main paper.
Concerning RQ1, this view is supported by the first two rows of Table 1, as well as Tables 3 and 4, which contain the results of the regression for the p(>1) variable -the likelihood of getting at least one upvote, i.e. of not failing. Consistently with what we discuss in the main paper, all these coefficients are negative and significant, showing that the more central a platform is, the lower the impact of originality.
The first two rows of Table 2, as well as Tables 5 and 6 contain the results of the regression for the p(>10%) variable -the likelihood of being in the top 10% scoring posts, i.e. of succeeding. These tables inform our discussion of RQ2 in the main paper. Consistently with what we discuss in the main paper, we do not see a clear negative coefficient for β 3 : if anything, we find evidence of a positive coefficient, although this is not robust across all models and centrality definitions.

Alternative Centrality
The results from the main paper are not dependent on our choice of centrality measure C d . Here, we replace the indegree with betweenness centrality, PageRank, and number of posts from a specific domain. For RQ1, the sign of β 3 in all these cases is still negative and significant for both the LPM with random effects, and for the CMLE logit model (both with p < 0.01) -see the first two rows of Table 1 Figure 1 from the main paper, with each row focusing on a different centrality measure. The patterns we observe in all the figures are the same. The difference between high and low centrality platforms is particularly evident when using PageRank as a centrality measure: the probability of getting one upvote shrinks for high centrality sources from 19.5% to 15.8% as originality increases from its minimum to its maximum value, while it increases for low centrality sources from 11.2% to 20.5%. Among these measures, the number of posts has the most different behavior, as it is the least correlated with the other measures -given that it is not a network centrality indicator. However, the effect still holds and it is significant.
With the exception of post counts, the alternative centrality measures also confirm the null result for RQ2 focusing on p(>10%) -see also the first two rows of Table 2. The coefficients for these alternative measures of centrality are shown in all full regression tables we include in this document for the Reddit dataset, from Table 3 to Table 6.

Alternative Originality
In our main analyses, we use a relatively simple definition of originality, O T . Here, we create an originality measure inspired by information entropy 1 . The probability of each bigram B x p(B x ) is first adjusted by the length of the post: p * (B x ) = p(B x )/(|T | − 1). Then, we sum over the contributions following Shannon's information entropy as ∑ p * (B x ) log 2 p * (B x ). Note that this causes some leftover, because ∑(p(Bx)/(|T | − 1)) < 1. We use λ T = 1 − ∑(p(Bx)/(|T | − 1)) to indicate this information leftover and we add it to the sum: If all bigram probabilities p(B x ) are high and roughly equal to each other, the leftover will be small and the information entropy IE T will be high. IE T is inversely proportional to originality. We also need to adjust it according to the length of a post, resulting in: This measure should capture the originality of a post better than our naive main measure, because it better quantifies the amount of information (i.e. surprise) in a text by using an information-theoretic measure, rather than relying on the most basic framework (Naive Bayes). However, it is less intuitive than our simpler measure.
Results for RQ1 look the same when we use this information-theoretic originality definition. Figure 2 (a) shows the interaction effect of this new O T variable with centrality for p(>1). The negative interaction (decreasing originality slope as centrality goes up) is in line with the effect from the original variable we showed in Figure 1 in the main paper. In Figure 2(b), which shows the model for p(>10%) we instead see a very small but statistically significant positive interaction.
In Tables 7 and 8 -as well as in the third row of Tables 1 and 2 -we see the coefficients for the p(>1) and p(>10%) regressions when using the alternative O T originality definition based on information-theoretic entropy. We confirm the solid negative β 3 coefficient for p(>1), and the small positive coefficients for p(>10%).

Alternative Success
Upvotes might not be the right measure of success -an alternative could be that a successful post is one that generates a discussion. In this sense, what matters is not the number of upvotes, but the number of comments. Thus, in this section we change our p(>1) and p(>10%) variables. Rather than being the probability of getting one upvote and being in the top 10% upvoted posts, they are now the probability of getting at least one comment or being in the top 10% posts in number of comments. Figure 3 reproduces the results from Figure 1 in the main paper by changing the target variable from upvotes to comments. The result for p(>1) is maintained when switching from an upvote-based definition of success to a comment-based one. For the p(>1) regression, the size of the interaction effect is smaller, but the β 3 coefficient is still negative and significant, confirming the answer to our first research question. Interestingly, the negative interaction now also holds (weakly) for p(>10%).
A possible explanation as of why the size of the effect is reduced focuses on the rarity of comments over upvotes: receiving one additional comment is a rare occurrence, and it is more similar to being part of the top posts than merely receiving one upvote. In fact, while you need more than 200 upvotes to be in the top 10% scoring posts, you only need 8 comments to rank similarly.
In the fourth row of Tables 1 and 2, and in Tables 9 and 10, we see the coefficients for the p(>1) and p(>10%) regressions when using comments as the basis of a post's success, rather than upvotes. The coefficients confirm the conclusions abovewith the exception of the model using the number of posts as the platform's centrality (as discussed in section 2).

Consistency Across Months
We obtain the main results for our paper from the December 2019 data on Reddit. As a robustness check, we test whether the results hold for a longer period of time. We replicate the main CMLE logit regression using data from November 2019 to January 2020. Enlarging the time window further introduces too many subreddits and we cannot fit the random effects LPM in the memory we have available -which is also why we run the logit model rather than the LPM as in the other robustness checks. Since we cannot fit this LPM, we keep only December 2019 for the main result and we limit this analysis to a robustness check. The fifth row of Tables 1 and 2, as well as Tables 11 and 12, show the results for RQ1 and RQ2 with this extended dataset. In both cases, we can confirm our answers from the main analysis. The β 3 coefficients are negative and significant for p(>1), confirming the negative centrality-originality interaction for RQ1. For RQ2, we confirm the inconclusiveness of our results.

Infrequent Domains/Subreddits
One of the cleaning steps we take in the main analysis is removing all domains and subreddits that have not generated at least 5 posts during the observation period. This is done under the assumption that sparsely used domains and subreddits might be dominated by fluctuations and thus add noise to our estimates. However, this has the potential of ignoring the periphery in the website network, and thus we should make sure that we are not biasing our results.
For this reason, in this section we replicate the main LPM regression without excluding the underused domains and subreddits. The sixth row of Tables 1 and 2, along with Tables 13 and 14, report the results. Once again, all results for both RQ1 and RQ2 are consistent with what discussed so far.

RQ2 Threshold
For RQ2 we want to focus on the most successful posts on Reddit. We thus need a threshold establishing which posts are among the top performers. For the main paper, we decide that the 10% top scoring posts are part of this set. This choice is motivated by the fact that we cannot restrict this set too much, because otherwise the logit regression would not provide accurate results if the outcome variable is mostly zero. On the other hand, we cannot make the set too large, otherwise it would overlap too much with the p(>1) set, making RQ2 indistinguishable from RQ1.
To show that the choice of the 10% threshold is robust, we repeat the experiment answering RQ2 by slightly modifying it. In Tables 15 and 16 -as well as in the bottom section of Table 2 -we change the outcome variable from p(>10%) to p(>7.5%) and p(>12.5%), respectively. In other words, we say that a post is part of the top performing posts if it is in the top 7.5% or 12.5% scoring posts.
The tables show that the erratic behavior of β 3 is confirmed in both cases and thus our negative answer to RQ2 is not dependent on the particular threshold we chose.

Negative Binomial Model for Upvote Counts
In this paper, we hypothesize (and find) that receiving minimal attention and becoming especially successful are different outcomes with different drivers. Nonetheless, we might be interested in a summary of the effects of originality and centrality on attention (number of upvotes). To do so, we modify our model. Rather than predicting a probability -like p(>1) -, we use the number of upvotes as the outcome variable. The number of upvotes is an over-dispersed count outcome variable -see Figure  3c in the main paper. Therefore, the most appropriate model is a negative binomial regression, which is more flexible than a Poisson model. Table 17 shows the result. In this case, the interaction effect with all centrality measures -excluding number of posts -is negative. This means that less central websites would fare relatively better, in number of upvotes, with more originality, when compared with highly central websites.

Alternative Data
Lastly, we address the potential issue that what we are observing might be a Reddit-specific phenomenon. To do so, we collect the US Politics Twitter dataset (https://files.pushshift.io/twitter/US_PoliticalTweets.tar.gz, date of last access: October 19th, 2021). This is a collection of 1.2M tweets sent between August 4th, 2008 and June 6th, 2017. The tweets are all in English from American sources and were collected around a selection of politically-related users. We apply the same data cleaning pipeline we applied to Reddit posts, excluding all the steps related to the web network which we do not have here.   shows that Twitter data has comparable hourly and weekly patterns as the Reddit data (Figure 3(b) in the main paper), albeit more extreme. This is due to the fact that the Twitter dataset has fewer users and is concentrated in the US. For this reason, the vast majority of tweets are written when it is daytime in the US. Reddit, on the other hand, has a significant user base outside the US, which continuously adds posts during US nighttime, resulting in a smoother distribution of posts over the day, as we show in the main paper.
In this dataset, we can estimate the originality of the text in the same way we estimate the originality of a Reddit title, since the length is comparable. Figure 5(a) shows that the distribution of originality values in Twitter is similar to the one in Reddit (compare with Figure 4(a) in the main paper). Figure 5(b) allows to compare the tweet length distribution with the Reddit title length distribution we show in Figure 4(b) in the main paper -Twitter lacks the long tail of outliers that Reddit has. Additionally, we report that the median length of a Reddit title is 8 tokens, while for a tweet it is 11.
The success of a tweet is the number of retweets, just as upvotes mark the success of a Reddit post. We use retweets because they perform the same visibility-enhancing function on Twitter as upvotes do on Reddit. An upvote makes a post on Reddit float to the top of the front page, which causes more people to see it. Similarly, a retweet publishes the tweet on more Twitter streams, enhancing its visibility. An alternative could be using likes, but crucially likes do not directly enhance a tweet's visibility in an obvious way like retweets do. Thus, we focus on retweets. Figure 4(c-d) show the distributions of retweets per tweet and per user. A striking difference with the Reddit data is that the odds of a tweet getting retweeted are much higher than the odds of getting one upvote. The probabilities are 86% for the former and 43% for the latter. Compared to Figure 3(c) in the main paper, we see that there is no sharp drop between 0 and 1: the head of the distribution is smoother. The rest of the distribution has a comparable long-tailed shape, although for retweets there is no noticeable exponential cutoff in the tail.
The difference in the probability of getting a retweet vs getting an upvote on Reddit is especially true for high-follower-count users: users with 100k or more followers have a 97% probability of getting at least one retweet per tweet, while users with fewer than 1k followers have a 67% chance. For this reason, as a first outcome variable, we do not look at the probability of getting one or more retweet (p(>1)) but rather we look at the probability of getting three or more (p(>3)). Overall p(>1) is 43% on Reddit, and p(>3) is 41% in our Twitter sample, rendering the two variables comparable.
As a measure of centrality, we use the in-degree of the author of the tweet -i.e., their follower count. We also have the timestamp of the tweet. Twitter does not have different subreddits -one could use hashtag information, but it is unclear whether it is an appropriate thing to do. In sum, we are able to estimate Equation 1 in the main paper with one slight modification: we leave out the random effect of subreddit α s . Figure 6 shows the originality-centrality interaction effect. Our core result is supported: the slope for p(>3) is higher for low-follower count users than it is for high-follower count. The probability of getting three retweets or more -with the specific controls we chose for the example -goes from 22% to 34% for low-follower users, while it decreases from 81% to 80% for high-follower users. Consistently with what we observe on Reddit, the positive answer to our first research question is confirmed -originality matters more for low-centrality users to avoid failure -, and the second answer is still negative. The relationship between originality and centrality for the top 10% scoring posts is unclear: β 3 appears to be positive on Twitter, but it is inconsistently positive, negative or insignificant across all our robustness checks, and thus no real conclusion can be made. Tables 1 and 2, as well as Tables 18 and 19, show the coefficients for the p(>3) and p(>10%) regressions in the Twitter dataset. Since in this dataset we only have a single measure of centrality -the follower count -, the tables only have a single column.

Network Variance Intuition
In the main paper, we show that the network variance of originality is lower than the one we would expect if originality distributed randomly in the network of websites. We use network variance because the simple correlation of degree and originality is not accurate enough to distinguish different diffusion scenarios. In this section, we provide simple toy examples with two aims. First, we need to support our statement that there are different ways of distributing originality that generate similar correlations with the degree, but that are intuitively different. Second, show the behavior of network variance, to confirm that it corresponds to our intuition. Consider Figure 7. In Figure 7(a) the high degree highly original nodes are somewhat scattered in the network. On the other hand, in Figure 7(b) they concentrate in the core. We would thus expect to find a difference between these networks: originality is more diffused in Figure 7(a) and more concentrated in Figure 7 degree is similar. It is equal to 0.96 and 0.98 in Figures 7(a) and 7(b), respectively. Thus this simple correlation is not sensitive to the differences in these structures. On the other hand, the network variances of originality are 1.29 and 0.43 in Figures 7(a) and 7(b), respectively. This shows that network variance is closer to our intuition when it comes to evaluate how dispersed originality is in the network.

Full Regression Tables
In all tables in this section, we report the standard errors of the estimates in parenthesis below the coefficient value. Each row of the top half of the table reports a different β coefficient, while each column reports the coefficient values for a collection of alternative related models. In the bottom half of the table we report information about the fixed effects -for which we omit the coefficients, since they are too many -, as well as statistics about the goodness of fit of the model. We again mark the statistical significance using a classical star notation: * for p < 0.1, * * for p < 0.05, and * * * for p < 0.01. log(posts), log(indegree + 1), log(betwenness), and log(pagerank) are alternative definitions for domain centrality, C d , with log(indegree + 1) being the main one. Thus, the model of reference is the one in the columns marked as (2). The coefficient of interest is the interaction between centrality and originality (β 3 ), thus reported in the O T :* rows, with * being one of the alternative C d measurements.